The purpose of the case study is to classify a given silhouette as one of four different types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles. Four "Corgie" model vehicles were used for the experiment: a double decker bus, Cheverolet van, Saab 9000 and an Opel Manta 400 cars. This particular combination of vehicles was chosen with the expectation that the bus, van and either one of the cars would be readily distinguishable, but it would be more difficult to distinguish between the cars. The purpose is to classify a given silhouette as one of three types of vehicle, using a set of features extracted from the silhouette. The vehicle may be viewed from one of many different angles.
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
#Load the data
data_df = pd.read_csv('vehicle.csv')
data_df.head(10)
features = data_df.columns.tolist()[0:-1]
target = data_df.columns.tolist()[-1]
print("feature lists--> %s \n\n\ntarget--> %s"%(features,target))
data_df.shape
#we can see that there are 846 rows and 19 columns ( including a target column)
data_df.info()
From the output we notice that there are missing values or NAN values for some of the features
#Lets check for NaN or null values
data_df.isna().any()
#Lets count the NaNs
data_df.isna().sum(axis=0)
#Lets fill NaNs with median
data_df.fillna(data_df.median(),inplace=True)
#Lets count the NaNs
data_df.isna().sum(axis=0)
#Great, all null values are taken care now
data_df.describe().transpose()
data_df['class'].value_counts()
From the output above we can infer
# Lets check outliers using boxplot
data_df.boxplot(figsize=(20,5),rot=50)
From the graph we can see that
# Lets take care of outliers now
# as per 5 number summary , we know that the values within 3 std from mean account for about 99.7% of the data set
# lets use this aproach to knock out data that fall below or above 3 std cutoff ( we can replace outliers with median values too)
# make copy of original data set
newdata = data_df.copy()
newdata.head(3)
# Taking care of radius_ratio outliers
data_mean = newdata['radius_ratio'].mean()
cutoff = newdata['radius_ratio'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['radius_ratio'] > upper) | (newdata['radius_ratio'] < lower)) ]
#So these are the outliers for radius_ration which we have to remove
#Remove the outliers indexes
newdata.drop(newdata[((newdata['radius_ratio'] > upper) | (newdata['radius_ratio'] < lower)) ].index,inplace=True)
# Taking care of pr.axis_aspect_ratio outliers
data_mean = newdata['pr.axis_aspect_ratio'].mean()
cutoff = newdata['pr.axis_aspect_ratio'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['pr.axis_aspect_ratio'] > upper) | (newdata['pr.axis_aspect_ratio'] < lower)) ]
#So these are the outliers for pr.axis_aspect_ratio which we have to remove
#Remove the outliers indexes
newdata.drop(newdata[((newdata['pr.axis_aspect_ratio'] > upper) | (newdata['pr.axis_aspect_ratio'] < lower)) ].index,inplace=True)
# Taking care of max.length_aspect_ratio outliers
data_mean = newdata['max.length_aspect_ratio'].mean()
cutoff = newdata['max.length_aspect_ratio'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['max.length_aspect_ratio'] > upper) | (newdata['max.length_aspect_ratio'] < lower)) ]
#So these are the outliers for pr.axis_aspect_ratio which we have to remove
#Remove the outliers indexes
newdata.drop(newdata[((newdata['max.length_aspect_ratio'] > upper) | (newdata['max.length_aspect_ratio'] < lower)) ].index,inplace=True)
# Taking care of scaled_variance outliers
data_mean = newdata['scaled_variance'].mean()
cutoff = newdata['scaled_variance'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['scaled_variance'] > upper) | (newdata['scaled_variance'] < lower)) ]
#So these are the outliers for scaled_variance which we have to remove
#Remove the outliers indexes
newdata.drop(newdata[((newdata['scaled_variance'] > upper) | (newdata['scaled_variance'] < lower)) ].index,inplace=True)
# Taking care of scaled_variance.1 outliers
data_mean = newdata['scaled_variance.1'].mean()
cutoff = newdata['scaled_variance.1'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['scaled_variance.1'] > upper) | (newdata['scaled_variance.1'] < lower)) ]
#So these are the outliers for scaled_variance.1 which we have to remove
#Remove the outliers indexes
newdata.drop(newdata[((newdata['scaled_variance.1'] > upper) | (newdata['scaled_variance.1'] < lower)) ].index,inplace=True)
# Taking care of scaled_radius_of_gyration outliers
data_mean = newdata['scaled_radius_of_gyration'].mean()
cutoff = newdata['scaled_radius_of_gyration'].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata['scaled_radius_of_gyration'] > upper) | (newdata['scaled_radius_of_gyration'] < lower)) ]
#So these are the outliers for scaled_radius_of_gyration which we have to remove, nothing to remove
#We can also use impute method to replace outliers with median, below method can be used for the same
def outlierTreatment(feature):
# Taking care of outliers
data_mean = newdata[feature].mean()
cutoff = newdata[feature].std()*3
lower ,upper = data_mean - cutoff , data_mean + cutoff
lower,upper
newdata[((newdata[feature] > upper) | (newdata[feature] < lower)) ]
#So these are the outliers which we have to remove
#Remove the outliers indexes
#newdata.drop(newdata[((newdata[feature] > upper) | (newdata[feature] < lower)) ].index,inplace=True)
#update outliers with median
newdata.loc[((newdata[feature] > upper) | (newdata[feature] < lower)),feature ] = newdata[feature].median()
newdata.boxplot(figsize=(20,8),rot=50)
count = 1
plt.figure(figsize=(15,15))
for feature in features:
plt.subplot(6,3,count)
plt.tight_layout()
plt.ylabel('Fequency Distribution')
sns.distplot(newdata[feature])
count = count +1
plt.plot()
From the graph above
sns.pairplot(newdata[features])
#Lets check with heat map
plt.figure(figsize=(20,15))
sns.heatmap(newdata[features].corr(),annot=True)
#Lets check correlation values, value close to 0 no correlation value close to (+/-) 1 has correlation
newdata[features].corr()
from scipy.stats import zscore
data_std = newdata[features].apply(lambda a:zscore(a))
data_std.head()
#Lets check correlation values, value close to 0 no correlation value close to (+/-) 1 has correlation
#data_std.corr()
#we can use below as well
cov_matrix = np.cov(data_std.T)
print('Covariance Matrix \n%s', cov_matrix)
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
print('Eigen Vectors \n%s' %(eigenvectors))
print('\n Eigen Values \n%s' %(eigenvalues))
#Lets plot cumulative explained variance
from sklearn.decomposition import PCA
pca = PCA().fit(data_std) # this actually generates eigen vectors(Principal components) and eigen values
plt.plot(np.cumsum(pca.explained_variance_ratio_))
#Lets print the same
np.cumsum(pca.explained_variance_ratio_)
#Lets use 12 dimensions to build PCs , as these capture around 99.4 % of the information as we saw previously
pca = PCA(n_components=12)
pca.fit(data_std)
data_std_pca = pca.transform(data_std)
data_std_pca = pd.DataFrame(data_std_pca)
data_std_pca.head()
# so below is the transformed data using 12 principal components
# Lets pair plot to see the output now
sns.pairplot(data_std_pca,diag_kind='kde')
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data_std_pca, newdata[target],test_size=0.3,random_state=28)
print(x_train.shape,x_test.shape)
print(y_train.shape,y_test.shape)
def evaluateModel(model,name):
print("######Evaluate the model %s ###########"%(name))
print("Training Accuracy score : \t %s "%(model.score(x_train,y_train)))
print("Test Accuracy score : \t %s "%(model.score(x_test,y_test)))
print("Confusion matrix \n\t %s"%(metrics.confusion_matrix(y_test,y_pred)))
print("classicification report \n %s"%(metrics.classification_report(y_test,y_pred)))
model_nb = GaussianNB()
model_nb.fit(x_train,y_train)
y_pred = model_nb.predict(x_test)
model_nb.score(x_test,y_test)
#Lets evaluate the model
evaluateModel(model_nb, "Naive Bayes")
x_train,x_test,y_train,y_test = train_test_split(data_std, newdata[target],test_size=0.3,random_state=28)
model_nb = GaussianNB()
model_nb.fit(x_train,y_train)
y_pred = model_nb.predict(x_test)
model_nb.score(x_test,y_test)
#Lets evaluate the model
evaluateModel(model_nb, "Naive Bayes")
x_train,x_test,y_train,y_test = train_test_split(data_std_pca, newdata[target],test_size=0.3,random_state=28)
model_svm = SVC(kernel='rbf')
model_svm.fit(x_train,y_train)
y_pred = model_svm.predict(x_test)
model_svm.score(x_test,y_test)
#Lets evaluate the model
evaluateModel(model_svm, "Support vectore classifier")
x_train,x_test,y_train,y_test = train_test_split(data_std, newdata[target],test_size=0.3,random_state=28)
model_svm = SVC(kernel='rbf')
model_svm.fit(x_train,y_train)
y_pred = model_svm.predict(x_test)
model_svm.score(x_test,y_test)
#Lets evaluate the model
evaluateModel(model_svm, "Support vectore classifier")
#
- As part of data preprocessing we took a glance over number of independent features and number of overall records(rows) available as part of the dataset
- we found out that some of the independent features had missing values, so in order to fix these we replaced all of the missing values with median of that corresponding feature
- we also checked the statistical information of features by using describe() method and noticed that only few features like scatter_ratio, scaled_variance were skewed in nature and rest features looked as normally distributed
- we then checked for outliers using boxplot, we found out that some of the features had outliers and we used 5 number summary to identify the data set that needs to be considered as outliers and then removed them from the original data set ( we can also replace these data sets with median value of the feature)
- we draw histogram plot of each features to understand the distributions of data, found out that distribution was mixed in nature, few feature sets were normally distributed, some were binomial and random distribution in nature (having two and more kde density curve)
- then we used pair plot to understand the relationship between independent variables, found out many features were correlated either positively or negatively
- To cross check we used heatmap to see the correlation
- Ideally features which have correlation are considered in PCA, however in this usecase we have used entire set of indepenent features ( though we can actually select only those feature which have correlation from pairplot observations)
- Since PCA helps to increase signle to noise ration ( improves the information content by capturing the info that is availalbe in mathematical space but not used) and also used to reduce the dimension, we considered this for our analysis
- to start with we standardized the complete set of independent features using zscore
- Then generated covariance/correlation matrix , and also generated Principal compoments (eigen vectors) & eigen values
- we then plot cumulative explained variance of Principal components to understand how each PCs contribute to reduce the variance
- from the graph and explained variance values we found out that 12 principal components together contribute 99.4% in reducing the variance
- we then used 12 Principal component to transform the dataset and stored it in new dataframe
- we used pairplot to check relationship of these principal components and noticed that there was not relation ( so PCA has done good work)
Model building
- To build the model we choose Naive Bayes and Support Vector Classifier algorithm
- ### Model building using Naive Bayes algorithm
Without Principal components( original dataframe, standardized) , Naive Bayes algoritm gave us overall test score of 62%¶
With Principal components ( transformed data), Naive Bayes algorithm gave us good result with overall test score of 87%¶
- ### Model building using Support Vector Classifier algorithm
Without Principal components( original dataframe, standardized) , Support Vector Classifier algorithm gave us overall test score of 95.9%¶
With Principal components ( transformed data), Support Vector Classifier algorithm gave us almost same result with overall test score of 95.5%¶
Final statement
Principal components ( PCA ) in general will give best results when used in model than the original data set, as PCA makes sure to increase the information content by fetching the information from mathematical space that is not used or fed to the model As PCA works best on features having positive/negative correlation , Principal components may not yeild good results if these are not taken care in PCA ( choose only feature having relationship between them to generate PCs) In our usecase we took all the features for PCA , and noticed that Naive Bayes algorithm preformed well on Principal components